Goto

Collaborating Authors

 safe policy improvement


Towards Safe Policy Improvement for Non-Stationary MDPs

Neural Information Processing Systems

Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis. Safety is ensured using sequential hypothesis testing of a policy's forecasted performance, and confidence intervals are obtained using wild bootstrap.


Safe Policy Improvement by Minimizing Robust Baseline Regret

Neural Information Processing Systems

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.


Review for NeurIPS paper: Towards Safe Policy Improvement for Non-Stationary MDPs

Neural Information Processing Systems

Summary and Contributions: In this paper, the authors introduce a novel model-free, policy improvement-based algorithm, for smooth non-stationary Markov decision processes (NS-MDP), focusing on safety guarantees of their method. The method relies heavily on Assumption 1 (Smooth performance), implicitly assumed in [51], which enables the treatment of the off-policy evaluation (OPE) in the NS-MDP as a time-series forecasting (TSF) problem. The authors introduce safe policy improvement for NS-MDPs, which they term SPIN, which, under Assumption 1, iterates between a policy evaluation step and a policy improvement step. Importance sampling is used for OPE according to past evaluation samples. Then TSF is applied to estimate future performance, while wild bootstrapping is used to obtain uncertainty estimates for future performance.


Reviews: Safe Policy Improvement by Minimizing Robust Baseline Regret

Neural Information Processing Systems

The paper considers the problem of robust MDP. In particular, it considers that we have an uncertain model of transition probabilities of the MDP (with the rectangular uncertainty) as well as a baseline policy. The goal is to find a new policy such that it is guaranteed to be no worse than the baseline. This is called the safe policy improvement. A robust approach would find a policy that maximizes the worst case performance.


Decision-Point Guided Safe Policy Improvement

Sharma, Abhishek, Benac, Leo, Parbhoo, Sonali, Doshi-Velez, Finale

arXiv.org Artificial Intelligence

Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challenge in SPI is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states. By appropriately limiting where and how we may deviate from the behavior policy, we achieve tighter bounds than prior work; specifically, our data-dependent bounds do not scale with the size of the state and action spaces. In addition to the analysis, we demonstrate that DPRL is both safe and performant on synthetic and real datasets.


Towards Safe Policy Improvement for Non-Stationary MDPs

Neural Information Processing Systems

Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis.


Reviews: A Lyapunov-based Approach to Safe Reinforcement Learning

Neural Information Processing Systems

The focus is safe reinforcement learning under constrained markov decision process framework, where safety can be expressed as policy-dependent constraints. Two key assumptions are that (i) we have access to a safe baseline policy and (ii) this baseline policy is close enough to the unknown optimal policy under total variation distance (this is assumption 1 in the paper). A key insight into the technical approach is to augment the unknown, optimal safety constraints with some cost-shaping function, in order to turn the safety constraint into a Lyapunov function wrt the baseline policy. Similar to how identifying Lyapunov function is not trivial, this cost shaping function is also difficult to compute. So the authors propose several approximations, including solving a LP for each loop of a policy iteration and value iteration procedure.


Supported Trust Region Optimization for Offline Reinforcement Learning

Mao, Yixiu, Zhang, Hongchang, Chen, Chen, Xu, Yi, Ji, Xiangyang

arXiv.org Artificial Intelligence

Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.


More for Less: Safe Policy Improvement With Stronger Performance Guarantees

Wienhöft, Patrick, Suilen, Marnix, Simão, Thiago D., Dubslaff, Clemens, Baier, Christel, Jansen, Nils

arXiv.org Artificial Intelligence

In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy's performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.


Safe Policy Improvement for POMDPs via Finite-State Controllers

Simão, Thiago D., Suilen, Marnix, Jansen, Nils

arXiv.org Artificial Intelligence

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.